ANOVA (One-Way): Analysis of Variance from Scratch (NumPy)#
ANOVA answers a simple question:
Do these groups look like they come from populations with the same mean?
It’s the standard test when you want to compare the mean of a numeric outcome across 3+ independent groups (A/B/n experiments, multiple treatments, different customer segments, …) without doing many pairwise t-tests.
Learning goals#
By the end you should be able to:
state the null/alternative hypotheses for one-way ANOVA
compute the ANOVA pieces (SSB, SSW, degrees of freedom, F) using only NumPy
interpret the F statistic, p-value, and a simple effect size (η²)
visualize what “between-group vs within-group variation” means
understand the main assumptions and what to check when they’re violated
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) When to use one-way ANOVA#
Use one-way ANOVA when:
your outcome is numeric (e.g., revenue, time, blood pressure)
you have one categorical factor with k levels (e.g., treatment A/B/C)
the samples are independent (different people/items in each group)
Hypotheses#
With groups (1, \dots, k):
Null (H_0): all group means are equal
(\mu_1 = \mu_2 = \cdots = \mu_k)Alternative (H_1): at least one mean differs
If you only have two groups, one-way ANOVA is equivalent to a two-sample t-test:
(F = t^2).
2) Why not run many pairwise t-tests?#
If you compare k groups with pairwise t-tests, you do
[ \text{#comparisons} = \frac{k(k-1)}{2} ]
Even if all means are actually equal, doing many tests makes it likely that at least one comparison looks “significant” just by chance.
A rough (independence) approximation of the family-wise false-positive probability is:
[ \Pr(\text{≥1 false positive}) \approx 1 - (1 - \alpha)^{m} ]
where (m) is the number of comparisons.
alpha = 0.05
k = np.arange(2, 21)
m = k * (k - 1) // 2
fwer = 1 - (1 - alpha) ** m
fig = px.line(
x=k,
y=fwer,
markers=True,
title="Type I error inflation from many pairwise tests (rough illustration)",
labels={"x": "Number of groups (k)", "y": "Pr(at least one false positive)"},
)
fig.add_hline(
y=alpha,
line_dash="dash",
line_color="gray",
annotation_text=f"single-test α={alpha}",
annotation_position="bottom right",
)
fig.update_yaxes(range=[0, 1])
fig.show()
3) A concrete example dataset#
We’ll create three independent groups with the same within-group noise but different means.
Later, we’ll compute ANOVA by explicitly splitting variance into:
between-group variation (how far group means are from the overall mean)
within-group variation (how spread out points are inside each group)
group_names = np.array(["A", "B", "C"])
n_per_group = 40
sigma = 1.0
means_alt = np.array([0.0, 0.3, 0.9])
groups_alt = [rng.normal(loc=m, scale=sigma, size=n_per_group) for m in means_alt]
values_alt = np.concatenate(groups_alt)
labels_alt = np.repeat(group_names, n_per_group)
df_alt = pd.DataFrame({"group": labels_alt, "value": values_alt})
df_alt.head()
| group | value | |
|---|---|---|
| 0 | A | 0.304717 |
| 1 | A | -1.039984 |
| 2 | A | 0.750451 |
| 3 | A | 0.940565 |
| 4 | A | -1.951035 |
overall_mean = df_alt["value"].mean()
means_by_group = df_alt.groupby("group")["value"].mean().reindex(group_names)
fig = px.violin(
df_alt,
x="group",
y="value",
color="group",
box=True,
points="all",
title="Example data: distributions by group",
)
fig.add_hline(
y=overall_mean,
line_dash="dash",
line_color="black",
annotation_text=f"overall mean = {overall_mean:.2f}",
annotation_position="top left",
)
fig.add_trace(
go.Scatter(
x=group_names,
y=means_by_group.values,
mode="markers",
marker=dict(size=12, color="black", symbol="x"),
name="group means",
)
)
fig.show()
4) The core idea: decompose the variance#
Let (y_{ij}) be observation (j) in group (i).
Group mean: (\bar{y}_i)
Overall mean: (\bar{y})
Sums of squares#
Total variation around the overall mean:
[ SS_T = \sum_i \sum_j (y_{ij} - \bar{y})^2 ]
Between-group variation (how far group means are from the overall mean):
[ SS_B = \sum_i n_i(\bar{y}_i - \bar{y})^2 ]
Within-group variation (how spread out points are inside each group):
[ SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2 ]
These satisfy:
[ SS_T = SS_B + SS_W ]
From sums of squares to the F statistic#
Degrees of freedom:
(df_B = k - 1)
(df_W = N - k)
Mean squares:
(MS_B = SS_B / df_B)
(MS_W = SS_W / df_W)
Test statistic:
[ F = \frac{MS_B}{MS_W} ]
Intuition:
under (H_0) (equal means), both (MS_B) and (MS_W) estimate the same noise variance ⇒ (F) tends to be near 1
when group means are far apart compared to within-group noise ⇒ (MS_B) grows ⇒ (F) gets large
def one_way_anova_from_scratch(groups):
"""One-way ANOVA (independent groups) computed with NumPy only.
Returns a dict with the main ANOVA quantities and a simple effect size.
"""
groups = [np.asarray(g, dtype=float).ravel() for g in groups]
if len(groups) < 2:
raise ValueError("Need at least 2 groups")
n_i = np.array([len(g) for g in groups], dtype=int)
if np.any(n_i < 2):
raise ValueError("Each group should have at least 2 observations")
k = len(groups)
n = int(n_i.sum())
means_i = np.array([g.mean() for g in groups])
overall_mean = float(np.sum(n_i * means_i) / n)
ss_between = float(np.sum(n_i * (means_i - overall_mean) ** 2))
ss_within = float(np.sum([np.sum((g - m) ** 2) for g, m in zip(groups, means_i)]))
ss_total = ss_between + ss_within
df_between = k - 1
df_within = n - k
ms_between = ss_between / df_between
ms_within = ss_within / df_within
f_stat = ms_between / ms_within
eta_sq = ss_between / ss_total # proportion of variance explained by group membership
return {
"k": k,
"n": n,
"group_sizes": n_i,
"group_means": means_i,
"overall_mean": overall_mean,
"SS_between": ss_between,
"SS_within": ss_within,
"SS_total": ss_total,
"df_between": df_between,
"df_within": df_within,
"MS_between": ms_between,
"MS_within": ms_within,
"F": float(f_stat),
"eta_sq": float(eta_sq),
}
res_alt = one_way_anova_from_scratch(groups_alt)
print(f"F({res_alt['df_between']}, {res_alt['df_within']}) = {res_alt['F']:.4f}")
print(f"η² (eta squared) = {res_alt['eta_sq']:.4f}")
print()
print("Group means:", dict(zip(group_names, np.round(res_alt["group_means"], 4))))
print(
f"SS_between + SS_within = {res_alt['SS_between'] + res_alt['SS_within']:.4f} "
f"(SS_total = {res_alt['SS_total']:.4f})"
)
F(2, 117) = 6.5086
η² (eta squared) = 0.1001
Group means: {'A': 0.0383, 'B': 0.3132, 'C': 0.6674}
SS_between + SS_within = 79.4740 (SS_total = 79.4740)
# Visual: how SST splits into between vs within
sst = res_alt["SS_total"]
fig = go.Figure()
fig.add_trace(go.Bar(name="Between groups (SSB)", x=["SST"], y=[res_alt["SS_between"] / sst]))
fig.add_trace(go.Bar(name="Within groups (SSW)", x=["SST"], y=[res_alt["SS_within"] / sst]))
fig.update_layout(
barmode="stack",
title=f"Variance decomposition (η² = SSB/SST = {res_alt['eta_sq']:.3f})",
yaxis_title="Proportion of total variation (SST)",
showlegend=True,
)
fig.update_yaxes(range=[0, 1])
fig.show()
# Visual: group means with a simple 95% CI (normal approximation)
summary = (
df_alt.groupby("group")["value"]
.agg(["mean", "std", "count"])
.reindex(group_names)
.reset_index()
)
summary["se"] = summary["std"] / np.sqrt(summary["count"])
summary["ci95"] = 1.96 * summary["se"]
fig = go.Figure(
data=[
go.Scatter(
x=summary["group"],
y=summary["mean"],
error_y=dict(type="data", array=summary["ci95"], visible=True),
mode="markers",
marker=dict(size=12),
)
]
)
fig.update_layout(
title="Group means with rough 95% CIs (for intuition)",
xaxis_title="Group",
yaxis_title="Mean outcome",
)
fig.show()
5) Getting a p-value with NumPy only (permutation test)#
Classical ANOVA reports a p-value from the F distribution.
If you want a p-value using only NumPy, you can use a permutation test:
Compute the observed (F) on the real grouped data.
Break the link between values and group labels by repeatedly shuffling observations.
Recompute (F) each time to build a null distribution of (F) values.
The p-value is the fraction of shuffled (F) values that are ≥ the observed (F).
This answers:
“If group membership didn’t matter, how often would we see an F this large just by chance?”
def permutation_test_anova(groups, n_permutations=5000, rng=None):
"""Permutation test for one-way ANOVA using only NumPy.
Notes:
- Keeps group sizes fixed.
- p-value is Monte Carlo estimated; increase n_permutations for accuracy.
"""
if rng is None:
rng = np.random.default_rng(0)
groups = [np.asarray(g, dtype=float).ravel() for g in groups]
sizes = np.array([len(g) for g in groups], dtype=int)
values = np.concatenate(groups)
observed = one_way_anova_from_scratch(groups)["F"]
null_f = np.empty(n_permutations, dtype=float)
split_points = np.cumsum(sizes)[:-1]
for b in range(n_permutations):
perm = rng.permutation(values)
perm_groups = np.split(perm, split_points)
null_f[b] = one_way_anova_from_scratch(perm_groups)["F"]
p_value = (np.sum(null_f >= observed) + 1) / (n_permutations + 1)
return {"F_observed": float(observed), "p_value": float(p_value), "F_null": null_f}
perm_res = permutation_test_anova(groups_alt, n_permutations=5000, rng=rng)
print(f"Observed F = {perm_res['F_observed']:.4f}")
print(f"Permutation p-value ≈ {perm_res['p_value']:.4f} (5000 shuffles)")
Observed F = 6.5086
Permutation p-value ≈ 0.0022 (5000 shuffles)
fig = px.histogram(
x=perm_res["F_null"],
nbins=60,
title="Permutation null distribution of F",
labels={"x": "F under H0 (shuffled groups)", "y": "count"},
)
fig.add_vline(
x=perm_res["F_observed"],
line_color="red",
line_width=3,
annotation_text=f"observed F = {perm_res['F_observed']:.2f}<br>p ≈ {perm_res['p_value']:.3f}",
annotation_position="top right",
)
fig.show()
6) How to interpret ANOVA results#
ANOVA gives you an (F) statistic with two degrees of freedom:
numerator: (df_B = k - 1) (how many group means can vary)
denominator: (df_W = N - k) (how many residual degrees of freedom remain)
What a small vs large F means#
Small (F) (≈ 1): between-group variation is about the same size as within-group noise ⇒ group means are not clearly separated.
Large (F): group means are far apart relative to within-group noise ⇒ evidence against (H_0).
What the p-value means#
The p-value is:
The probability (under (H_0)) of seeing an (F) at least as large as the one you observed.
If (p < \alpha) (e.g., 0.05), you reject (H_0) and conclude:
Not all group means are equal.
Important: ANOVA does not tell you which groups differ — that requires post-hoc comparisons (e.g., Tukey HSD) or planned contrasts.
# Quick contrast: simulate a "null" case where means are truly equal
means_null = np.array([0.0, 0.0, 0.0])
groups_null = [rng.normal(loc=m, scale=sigma, size=n_per_group) for m in means_null]
res_null = one_way_anova_from_scratch(groups_null)
print(f"Null simulation: F({res_null['df_between']}, {res_null['df_within']}) = {res_null['F']:.4f}")
Null simulation: F(2, 117) = 0.5630
7) Assumptions and quick diagnostics#
One-way ANOVA is most trustworthy when:
Independence: observations are independent (design issue; not a plot).
Normal-ish residuals: within each group, values are roughly normal (ANOVA is fairly robust with decent sample sizes).
Equal variances (homoscedasticity): within-group variances are similar.
Below are quick visuals you can do before/after the test.
# Residuals: y - group_mean
fitted_alt = np.concatenate([np.full(len(g), g.mean()) for g in groups_alt])
residuals_alt = values_alt - fitted_alt
# 1) Residuals vs fitted
fig = px.scatter(
x=fitted_alt,
y=residuals_alt,
color=labels_alt,
title="Residuals vs fitted (group means)",
labels={"x": "fitted value (group mean)", "y": "residual"},
)
fig.add_hline(y=0, line_dash="dash", line_color="gray")
fig.show()
# 2) Residual distribution by group
df_res = pd.DataFrame({"group": labels_alt, "residual": residuals_alt})
fig = px.box(
df_res,
x="group",
y="residual",
points="all",
color="group",
title="Residuals by group (look for similar spread)",
)
fig.add_hline(y=0, line_dash="dash", line_color="gray")
fig.show()
# 3) Compare sample variances directly
vars_by_group = df_alt.groupby("group")["value"].var(ddof=1).reindex(group_names)
fig = px.bar(
x=group_names,
y=vars_by_group.values,
title="Sample variance by group (homoscedasticity check)",
labels={"x": "group", "y": "sample variance"},
)
fig.show()
8) Practical usage (SciPy) + sanity check#
In practice you’ll usually compute the parametric p-value from the F distribution.
If SciPy is available, this cell cross-checks the from-scratch (F) and shows the classical p-value.
try:
from scipy import stats
f_scipy, p_scipy = stats.f_oneway(*groups_alt)
print(f"SciPy f_oneway: F = {f_scipy:.4f}, p = {p_scipy:.6f}")
print(f"From scratch: F = {res_alt['F']:.4f}")
except Exception as e:
print("SciPy not available (or failed to import).")
print("Error:", repr(e))
SciPy f_oneway: F = 6.5086, p = 0.002088
From scratch: F = 6.5086
9) Common pitfalls#
Rejecting (H_0) doesn’t tell you which groups differ: do post-hoc tests / planned contrasts.
Non-independence breaks the test: repeated measurements need repeated-measures ANOVA / mixed models.
Unequal variances: consider Welch’s ANOVA (robust to heteroscedasticity).
Heavy tails / strong non-normality: consider Kruskal–Wallis (rank-based) or a permutation approach.
Multiple factors / interactions: use two-way ANOVA or linear models.
10) Exercises#
Simulate data where group means are equal but variances differ strongly. What happens to (F)?
Increase the group mean separation while keeping (\sigma) fixed. How does (F) change?
Increase the sample size per group while keeping mean separation fixed. What happens to the permutation p-value?
Implement a simple post-hoc procedure:
all pairwise mean differences
a Bonferroni-corrected threshold for significance
References#
NIST Engineering Statistics Handbook — One-way ANOVA
Montgomery, Design and Analysis of Experiments
Gelman et al., Bayesian Data Analysis (for model-based perspective)